Title: 


Author: 
Affiliation: 
Conference: 
Sponsor: 


Abstract: 


Results Page Previous Record 


ISI Bibliography: N/A ISI Times Cited: 0 
Availability, reliability, and maintainability aspects of the SPERRY UNIVAC. 


Boone, LA ; Liebergot, HL ; Sedmak, RM 
Sperry Univac, Blue Bell, PA, USA 


10th International Symposium on Fault-Tolerant Computing, 1-3 Oct. 1980, Kyoto, Japan 
IEEE ; Inst. Electron. Commun. Eng. Japan 


Describes the fault tolerant capabilities of the SPERRY UNIVAC 1100/60 Information Processing System, a recently 
announced medium scale general purpose computer system. In the 1100/60, a variety of techniques is employed for 
the detection/correction and isolation of, and recovery from, most single-bit hardware faults as well as many 
multiple-bit faults. An approach for checking fault detection circuits is implemented using a comprehensive fault 
injection system. A four level maintenance philosophy based around the built-in fault handling logic, scan network, 
and intelligent support processor and console provides for rapid location and repair of the failing logic. (6 refs.) 


Subject Computer architecture ; Fault tolerant computing ; General purpose computers ; Single bit hardware ; Multiple bit 


(Inspec): 


Classification 
(Inspec): 


Publisher: 


Description: 


fault 


Computer architecture (C5220) ; Mainframes and minicomputers (C5420) 


New York, NY, USA : IEEE, 1980 
p.3-8 


Doc. Type: Conference Publication 


Treatment: 
Language: 


Control No.: 


New Development 
French 
1658453 


Contains material from the following databases: 


(Click name to limit record display to a single database) 


Inspec 


Copyright 2004: IEE 


Results Page Previous Record 


&, 


% . 


AVAILABILITY, RELIABILITY, AND MAINTAINABILITY ASPECTS 


OF THE SPERRY UNIVAC 1100/60 


R.M. Sedmak, L.A. Boone, and H.L. Liebergot 


ABSTRACT 


The SPERRY UNIVAC 1100/60 information processing system incorporates a high level of fault tolerance 
capability. A variety of techniques is employed for the detection/correction and isolation of, and recovery from, 
most single-bit hardware faults as weil as many multiple-bit faults. An approach for checking fault detection 
circuits is implemented using a comprehensive fault injection system. A four-level maintenance philosophy 
based around the built-in fault handling logic, SCAN network, and intelligent support processor and console 
provides for rapid location and repair of the failing logic. : 


During development, substantial resources were devoted to assuring the quality of the 1100/60 design, 
recognizing that fault tolerance strongly complements, but is not a substitute for, design quality and correctness. 


An evaluation of the various fault handling features was carried out to provide measures of system availability, 
reliability, and maintainability. 


KEYWORDS 


fault tolerance, fault detection/correction, error recovery, availability, reliability, maintainability 


CONTACT AUTHOR 


R. M. Sedmak Sperry Univac 
Research and Technical Planning M.S. A2-412 
(215) 542-3638 P.O. Box 500 


Biue Bell, PA 19424 


Note 


This paper has been cleared by Sperry Univac for publication after announcement of the 1100/60. This draft is being 
submitted for review purposes only. The contents of this paper are proprietary to Sperry Univac and are not to be made public 
prior to publication. This paper is not to be reproduced. This draft should be returned to R. M. Sedmak after the review is 
completed. 


mere 


A. 


INTRODUCTION TO ARM 


Definition 


At Sperry Univac, the acronym ARM stands for availability, reliability, and maintainability. ARM concepts 
include: the organizational procedures used to develop systems; the tools and techniques used during 
design, development, and manufacturing; and the logic, firmware, and software that are included to 
minimize the effects of failure. 


Trends in the Computer Industry 


Certain general trends can be seen in the use of medium to high performance computer systems such as” 


the 1100/60, involving the type of processing that the computer is used for, and simultaneously, the 
dependence of the computer owner on the system. 


The initial use of commercial data processing machines in this performance range was batch processing, 
and this use continued for many years. It is characterized by job submission to a clerk or periodic 
scheduling of a job, and subsequent pickup of the output at some Jater time, typically hours after 
submission. Failures of the system are generally hidden from the end user, although visible to the data 
processing department. The owner of the computer is using the system to provide reports that would take 
more people and more time to provide in a manual mode. 


The advent of good communication facilities led to several new developments. The first was remote batch 
processing, in which jobs can be submitted directly to the system from a remote location, and results can 
be returned at the direction of the remote user. In this mode, a system failure is visible to the user if the 
system is unavailable at the time transmission is desired. , 


Another development was demand processing. In this mode, the remote user remains connected to the 
system for the duration of the job. Typical examples of such use are complex engineering calculations, data 
base manipulation, and program development. In this mode, a failure of the system instantly affects all of 
the demand users who are connected to the system. This use of the system cannot be easily duplicated by 
manual means. The computer owner is directly affected by a system failure since the employees cannot 
perform their normal activities. 


A more sophisticated type of processing that is used frequently is called real-time processing. Rather than 
being job oriented, this mode is transaction oriented, with remote users entering and retrieving information 
into a centralized data base, such as with airline reservations. Systems of this type may form the basic 
structure of a business. Not only are all users instantly aware of a system outage, but the business may 
suffer instantaneous direct losses as a result of the inability to complete business transactions. Real-time 
processing systems require the highest system availability. 


The trends in data processing and dependence on the computer show the need for higher ARM 
requirements as the use of computers increases around the world. 


Trends in Technology 


Early computers used vacuum tubes as the main switching elements. These tubes had a high failure rate; 
and since many of them were used in one computer, the computer itself often hada Mean Time Between 
Failure of no more than 1/2 hour. 


The advent of the transistor caused a dramatic increase in the reliability of the computer. Initially, the 
transistor replaced the tube as the switching element, but the complexity of the computer did not change 


dramatically. Thus, the higher reliability of the transistor was not offset by the need for a larger number of | 


them. 


The recent trend toward more highly integrated circuits with the equivalent of thousands of transistors on 
one integrated circuit chip has paralleled a trend toward much more complex computers. Thus, the number 
of equivalent transistors in a computer has increased significantly, which tends to make the designing of a 
reliable computer more difficult to achieve. In addition, the ability to test the computer and identify 
problems is more difficult when dense chips are utilized. 


Fortunately, other trends in the technology of computers allow some new approaches to the reliability 
problem to be followed economically. For example, frequently all of the circuits in an integrated circuit chip, 
or all of the space available on a printed circuit card, cannot be utilized because of limited input/output 
connections to the chip or card. The unused circuits or space can be devoted to error detection in many 
cases without significant additional cost, since only an error signal is required as an external indicator [1]. 
This type of approach has been followed in the 1100/60 and offsets the trend toward higher complexity. 


A. 


“tl ARM PHILOSOPHY FOR 1100/60. 


ARM in Previous SPERRY UNIVAC 1100 Series Systems 


The SPERRY UNIVAC 1108 [2], introduced in 1965, was the first Sperry Univac 1100 Series computer to 
offer a multiprogramming operating system and the first to offer multiprocessing configurations. These two 
capabilities reflect a general ARM approach that has been carried forward in other Sperry Univac systems 
- the attempt to isolate a problem to either a particular job in the system or to a particular unit in the 
configuration. ; : 


The success of this approach is determined largely by the error detection attributes of the system. For a 
typical 1108 system, error detection consists of parity in the main storage and processor general registers. 
For the successor 1100/10 and 1100/20 systems, this coverage was enhanced to include parity on the 
1/O channels and in some of the mass storage control units, while the main storage utilized an error 
detection/correction code instead of parity. Maintenance on the central complex (instruction processor, 
input/output unit, and main storage) is performed using a built-in maintenance panel and diagnostic 
programs. 


The 1100/40 system has al! of the previous error detection capabilities, and in addition, a maintenance 
controller as an adjunct to the maintenance panel. The controller incorporates a SCAN COMPARE 
capability that allows operations in the processor to be examined in each clock cycle and compared to 
known correct data from a magnetic tape. Any difference can be used to indicate the location of the 
problem area. This was the first automation of the maintenance task for 1100 series systems. 


in the 1100/80, additional error detection is provided for the input/output unit, and a cache memory with 
parity has been added. The maintenance controller is replaced with a maintenance processor - an 
intelligent unit that can write several of the registers in the central complex and read almost all of them. 
The customer engineer's interface to the system is via CRT rather than a maintenance panel, and the 
maintenance processor can function even if most of the central complex is disabled. 


ARM in the 1100/60 - General Approach 


A fundamental requirement for the 1100/60 was to produce a system whose ARM attributes matched the 
processing modes of the future with the technology of today. The price/ performance goals allowed the 
selection of proven ECL technology for the instruction processor (IP) and cache, TTL for the input/output 
unit, and 16K MOS chips for the main storage. In addition, the IP is microprogrammed and uses four bit- 
slice microprocessors to achieve a reduction in component count. Established design rules allow adequate 
temperature, voltage, and timing margins. The basic unit processor‘and most expansion features are air- 
cooled and packaged in one cabinet. 


Error detection has been given increased emphasis in the 1100/60 as compared to previous systems. This 
provides protection from incorrect results, aids in system recovery and reconfiguration, and helps to isolate 
a failure to a replaceable unit. The 1100/60 uses duplication, coding techniques, and parity to provide 
error detection throughout the system. 


The main storage and the IP’s microcode contro! storage have error correction capabilities to allow error 
recovery to be transparent to the user. The cache memory architecture allows recovery from an error by 
retrieving desired data from the main storage when an area in the cache is disabled. Extensive retry 
capabilities are provided to recover from transient or intermittent faults in the nonmemory areas of the 
central complex. Solid faults are handled by reconfiguration. 


DETAILED ARM IMPLEMENTATION 


System Characteristics 


The 1100/60 instruction processor is microprogrammed and is based on a high usage of LSI 
microprocessors. The amount of hardware required for the IP has been kept low by implementing a large 
portion of the contro! functions in microcode. The net effect of such an approach is to replace the 
traditional mass of random SS!/MSI gates with high-density LSI arithmetic and storage components. The 


' high throughput of the 1100/60 is achieved by the use of multiple microprocessors, deep overlap at the 


microinstruction level, prefetch of instructions and operands, and other design techniques [3]. 


Figure 1 depicts a block diagram of the main data paths of the IP. Each 1100/60 macroinstruction and 
interrupt is performed by executing a series of microinstructions. The execution of each microinstruction 
consists of bringing data from the general register set, main storage, or other source through the shifter 
into the subprocessors. Here, using multiple microprocessor chips, an arithmetic or logical operation is 
performed, combining the data from the shifter with data from the local storages or the accumulators, and 
placing the results in the accumulators internal to the microprocessor chips. At this point, the data may be 
placed onto the main data bus; and from there, the data can be written into local storage, the general 
register set, or main storage. These data movements are controlled by a microinstruction that partitions the 
work between microprocessors, selects the various resources (registers, local storage, etc) to be used, and 
lays the groundwork for the selection of the next microinstruction. 


The 1100/60 central complex contains in addition to the IP, an input/output unit (IOU), cache or storage 
interface unit (SIU), main storage unit (MSU), and a system support processor (SSP). The functions of each 
of these units are well known except for the SSP, a freestanding, intelligent processor that employs a CRT 
and keyboard and serves as the system maintenance and operator consoles. In addition, this unit carries 
out the functions of system partitioning, contro! storage loading, fault injection, and some assorted system 
support tasks. Through an interrupt mechanism, the system software communicates with the SSP. 


Fault Detection 


The philosophy in the 1100/60 is to detect 100 percent of all single-bit faults in the data path, and many 
single faults in control logic. The rationale for such a philosophy is based on the principle of minimizing the 
probability that a user’s data could be corrupted without the system detecting or correcting the erroneous 
operation and signaling the operator. As a general rule, faults are detected in storage elements by the use 


-of parity codes, while greater redundancy is used for arithmetic and control circuitry. The various detection 


circuits are strategically placed in an effort to achieve a high coverage in hardware such as storages, which 
have the highest failure rate, and in areas such as the main data path, which has a high usage and where 
a large impact on the system might be experienced if a failure occurred. 


An example of fault detection through redundancy is shown in Figure 2. Each of the two 36-bit master 
subprocessors is paired with a duplicate subprocessor that performs the same function on the same data 
as the master subprocessor. During each microcycie, only one of the master subprocessors can drive the 
main data bus; and when one of them is chosen to drive it, its duplicate drives a duplicate bus. At the end 
of the cycle, a comparison is made between the data on the two buses; any discrepancy will cause the 
operation to be interrupted. , 


An overview of the fault detection techniques used in various portions of the 1100/60 IP is.given in Figure 
1. The general register set, local storages, and the shifter input selector employ conventional parity; the 
shifter and, as discussed above, the subprocessors (including the main data bus) are duplicated and 
compared. i 


During the development of the 1100/60, an. observation was made that is contrary to most claims about 
fault detection and its associated performance impact. Many sections of the data path are duplicated and 
compared to achieve fault detection. However, the duplicated logic serves another purpose: it provides 
additional output drive capability (i.e., loading) for the functional circuits. For example, if a functional circuit 
has 8 unit loads on its outputs, the addition of a duplicate circuit could reduce that number to 4 by splitting 
the loads between the functional and duplicate circuits. A side benefit is a reduction ih the propagation 
delay time of the output stage of the functional circuit. In fact, as additional portions of the data path were 
duplicated to detect faults, it was discovered that the combined effect could be an increase in performance 
compared to a strictly simplex design. This result points out the fact that fault detection need not have an 
adverse effect on performance. 


Several factors have contributed to keeping costs low for the fault detection features: 


2 Design approach of incorporating fault detection mostly in high-usage and high-failure-rate sections 
of the logic. 


o Incorporation of most of the control functions in microcode, which is stored in LSI storage 
components where fault detection is very economical. 


i] Heavy use of microprocessors, which are well-ordered, bus-oriented logic structures that lend 
themselves to conventional fault detection techniques. 


a The philosophy of designing fault detection circuitry at the same time as functional logic, and 
designing functional logic that lends itself to fault detection. 


As a result of these, and other factors, the extensive fault detection mechanisms require only 11 percent of 
the total CPU logic. This figure translates to less than 11 percent of the reproduction costs attributable to 
the detection logic. 


Error Correction 


The application of error-correction techniques lies in two primary areas: the main storage unit (MSU) and 
the IP’s control storage. The MSU employs an error correction code to correct single-bit faults and detect 
double-bit faults in the storage array chips during every read reference. When a double-bit fault occurs, an 
error signal is sent to the requesting unit indicating that the data should not be used. 


Error correction is also utilized in the microprocessors’ control storage, although a different approach is 
taken than in the main storage. When a single-bit fault occurs, the parity code stored with each 
microinstruction will permit the detection of the fault. An interrupt then allows the correction procedure to 
be initiated by the system support processor. After each attempt at correction, the SSP will read the data 
from the failed portion of the control storage to verify the proper correction. When proper correction is 
achieved, the SSP signals the IP to restart execution. 


Fault Isolation 


There are two major techniques used for fault isolation in the 1100/60. One technique relies on the high 
level of coverage provided by the fault detection capabilities in the processor. As a further explanation, 
consider two black boxes that represent the two extremes of coverage. One box has two fault detection 
circuits; one on the input and one on the output. Assuming the detection circuits will catch any single-bit 
fault in the data they are monitoring, the combined effect of the two detection circuits would be to detect 
any single-bit fault within the black box and isolate the failure to the box, or to the previous box sending 
data to it. This small amount of fault detection provides an inadequate level of isolation to the customer 
engineer who has to find the replaceable unit inside the black box. The other extreme would be a black box 
in which each replaceable unit {such as an integrated circuit or printed circuit card) has fault detection 
associated with it that could detect any single fault in that unit. Such a black box would have excellent 
isolation capabilities because any single fault that occurred in the box would be found immediately and the 
failed replaceable unit indicated. The 1100/60 was designed using the latter philosophy; thus, the sheer 
magnitude of fault detection capability in the IP contributes dramatically to the location and isolation of 
faults. : 


The other major technique employed in the 1100/60 for fault isolation is the diagnostic program ~ a tool 
used after symptoms of the failure have surfaced. This approach is usually employed when the failure has 
not been isolated sufficiently by the fault detection logic. Diagnostics are constructed in the form of 
microroutines or macroroutines, which are run on the failing unit. Frequently, these diagnostic programs 
make use of a tool in the 1100/60 known as SCAN COMPARE [4,5] which is a method of accessing the 
states of major test points in the IP. 


Utilizing the two methods, the capability exists to isolate automatically {i.e., without the need for manual 
diagnosis) any failure in the main data path to one or two printed-circuit cards. The obvious values of such 
a feature are to reduce the time to repair the central complex (and hence reduce field support costs) and 
dramatically improve the availability of the system to the user. 


Error Recovery 


The basic assumption in the development of an integrated error recovery procedure is that the computer 
system must be able to deal with both solid and transient faults. When a solid fault occurs, any operation 
affected can produce incorrect results; and the same operation using the same data will always produce 
incorrect results. Detection and isolation of such a fault is relatively easy, but recovery from it without 
manual intervention and repair is frequently difficult, if not impossible, unless some type of error correction 
or masking capability exists in the system. When a transient fault occurs, the recovery attempt is often 
successful if the transient fault has disappeared. Thus, a transient fault is difficult to detect and isolate due 
to its transitory nature; but compared to recovery from a solid fault, the failing operation can be recovered 
more completely, and the system configuration suffers less degradation. 


In the case of any failure, the ideal situation would be for the system to be able to mask, or otherwise 
tolerate any solid or transient fault, and continue the operation through such techniques as redundancy of 
modules and error correction codes. However, present technology and economics prohibit the practical 
application of these capabilities throughout a commercially available system. The most cost-effective 
compromise determined for the 1100/60 was to correct as high a percentage of the faults as economically 
practical, and to establish a recovery procedure for as many of the remaining faults as possible. !n spite of 
this, a major difficulty still existed because the desired response for a solid fault differs from that of a 
transient one. When a solid fault occurs, the approach is to correct it if possible or otherwise stop the unit 
and repair the fault. if a transient fault occurs (some estimates indicate that 50 percent of all faults are 
transient), the failed operation should be reexecuted after the system has been purged of the fault and its 
remnants. The problem when a fault is first detected is that there is no way to determine if it is solid or 
transient. To deal with this difficulty, a recovery procedure for the 1100/60 IP was created that assumes 
all faults are transient in nature on their first occurrence. When an error is detected and cannot be 
corrected or masked, a retry of the operation is carried out, if possible. If the retry is not possible, the 
operating system software is interrupted. If the retry or the fault recovery mechanism fails, the fault is 
assumed to be solid; correction or fault recovery continues under the control of the SSP. 


The same recovery philosophy is employed in each processor in an 1100/60-based multiprocessor system. 
However, should one processor become inoperative, the capability is provided to continue the operation on 
the balance of the system at a lower throughput rate. 


An interesting observation in this regard is that much of the hardware used in the implementation of the 
recovery mechanisms would have been needed for some of the normal operational functions (i.e., other 
than for ARM). The implication is that, frequently, portions of the hardware necessary for ARM features 
come virtually free as a result of already existing functions in the design, and the associated costs are thus 
much less than expected. 


Fault Injection 


In the 1100/60 processing system, a capability has been included that uses hardware and software to 
verify that the fault detection, isolation, and recovery mechanisms are operational. The capability is 
provided through fault injection, which refers to the process of causing a fault to occur in a system by 
inserting erroneous data or control signals in a portion of logic covered by a fault detection capability. The 
need for such a feature arises because the error handling portions of the design are not frequently 
exercised under normal operation of the system. Without a periodic verification of the integrity of the error 
handling logic, one could not be confident of its ability to function when needed. A side benefit of this 
capability is that it is quite useful during prototype testing in verifying the design of the hardware and 
software fault-handling capabilities. 


The 1100/60 system incorporates fault-injection for fault-detection circuits in the processor, input/output 
unit, storage interface unit, and main storage unit. In the IP, the injection of a fault is controlled by 
execution of microinstructions in which certain bits are set to initiate the process. In the case of the control 
storage or the small storage used for instruction decoding, the fault injection is under control of the SSP. 
For example, the SSP can inject a control storage parity error by writing incorrect parity directly into the 
desired address location in order to stimulate one of the associated parity checkers. 


In the input/output unit, injection is also under the control of the SSP. For example, a fault can be injected 
in the channel control word storage registers by setting and clearing flip-flops that specify the type of fault 
desired and the device operation during which the fault should occur. After the proper logic has been 
primed for the injection, the chosen fault will be triggered the next time the preselected device operation 
takes place. 


In the 1100/60 cache (SIU) and main storage unit (MSU), the injection process is controlled by an IP 
microcode routine. Forced faults internal to the SIU are specified by the unit requesting or sending data, 
and that requestor then expects a certain type of fault at a predetermined point in the operation. In the 
MSU, the microcode routine provides the capability to insert an invalid ECC code or bad parity on read data. 
On the access cycle of the MSU, the fault should be detected and an interrupt signaled. 


» 


Maintenance 


The maintenance philosophy for the 1100/60 incorporates four methods of dealing with a failure in the 
system. In the order of priority of use, they are: 


1. Automatic error log 


2. Macrodiagnostic tests 
3. SCAN COMPARE tests 
4. Manual troubleshooting 


The automatic error log represents a record kept by the system of any fault handled by the built-in fault 
detection/correction, isolation, and recovery mechanisms. In the majority of cases, the log should provide 
sufficient information for the customer engineer to determine the source of the problem. 


In those cases when the problem has not been completely identified by the error log information, 
macrodiagnostic tests are employed. These tests are written on a macroinstruction level; they serve to 
exercise most portions of the system establishing either a high level of confidence that the system is 
operating correctly, or determining the general area in which it has failed. 


If the previous two techniques have not identified the trouble area or if a finer resolution of the faiiure is 
needed, the scan-compare tests are run. These routines are tests run under the control of the SSP. They 
make use: of the previously mentioned SCAN network (built into the processor), permitting access to the 
state of all major storage elements in the IP. The tests will exercise the processor and compare the results 
to a table of predetermined correct results to establish the nature and source of the problem. A similar 
process is applied to the [OU, which is a hardwired unit. 1/O instructions are executed and the SCAN 
network is employed under control of the SSP without the use of the IP. 


Accessible through the SCAN network is a built-in logic analyzer which is available in each 1100/60 as an 
additional diagnostic too! for the customer engineer. This feature permits automatic storage of 1024 
consecutive states of any 16 logic points sampled twice during each micrainstruction cycle. The logic 
analyzer is helpful in the diagnosis of a particularly difficult failure mode, such as might exist in the 
presence of an intermittent fault. é 
Should none of the three methods above locate the failure, the customer engineer will resort to manual 
troubleshooting techniques. These manual efforts, however, are greatly enhanced by the features available 
in the SSP. For example, the SCAN network will permit the troubleshooter to capture and display on the 
CRT a substantial amount of internal test-point information that in the past would only have been available 
by using such tools as oscilloscopes and logic probes. In addition, the SSP has a communications capability 
that allows a linkage with a remote maintenance facility staffed with a team of diagnostic experts. This 
feature provides the ability to collect from a remote location any error information (for example, the 
contents of the logic analyzer) that is normally gathered and analyzed onsite. In addition, this interface can 
be used for remote contro! of diagnostic execution and troubleshooting procedures. 


~N 


IV. QUALITY. 


In the development of all computer products, it is recognized that fault detection/correction and handling 
techniques are not substitutes for a good basic design. Thus, the functional design must be correct for the 
system to be stable and available. The ARM features discussed previously are complementary to the good 
design practices mandatory for a reliable computer. 


To assure the inherent quality of the 1100/60, various techniques were employed, including worst-case 
design practices. In such an approach, the extreme limits of the component and packaging design 
parameters are used in the engineering design. For example, the propagation delay through a string of 
logic gates is always calculated by using the maximum delay (considering such factors as temperature, 
voltage, and aging) of each of the gates, pins, and wire runs or printed-circuit foil connections. By 
employing the worst-case design principles, the completed system is impervious to most environmental 
changes, as well as to internal electrical parameter variations. 


As a means of verifying the tolerance of the system to reasonable voltage variations during operation, the 
1100/60 IP incorporates power supplies with output levels that can be varied by programmable margins 
under contro! of the SSP. This allows customer engineering personnel at the computer site, or at the 
remote maintenance facility, to alter the output voltages easily and to observe the IP behavior. Hence, this 
margin testing tool provides one means of identifying marginally stable components before they have 
degraded to the level where a system fault may occur. 


A. Component Testing 


Before being used in the design of the processor, all components go through a qualification process in 
which the integrated circuit is thoroughly tested to determine if it meets the Sperry Univac specifications 
established for all components used in the company’s products. 


In addition, a high percentage of the incoming integrated circuit components used in the 1100/60 are 
tested individually before being placed on printed-circuit cards. The testing is at a functional level to ensure 
that the chips perform their desired actions. Some of the ECL chips are also tested dynamically to verify 
that the timing of the component is within the expected margins. Storage chips are given an especially 
rigorous test, which includes varying the voltage margins. The microprocessor chips are also tested for 
functional and timing attributes. After the printed circuit cards are assembled, extensive testing is carried 
out at the card, unit, and system levels. . 


V. ARM EVALUATION 


To measure how the design goals were being met with respect to the ARM characteristics of the IP, an 
evaluation method was developed that facilitates a quantitative prediction of the ARM behavior of the 
central complex. Using this method, an estimation was made of the fault detection/recovery coverage, the 
system Stability or mean time between stops (MTBS), the mean down-time (MDT), the mean time to repair 
(MTTR), and the availability of the central complex. 


The use of the evaluation method requires an initial examination of the various elements of the system and 
their anticipated contributions to the overall stability and availability of the central complex. After such an 
examination the two major ARM measures can be analyzed: MTBS and availability. MTBS is a measure of 
system stability and is calculated by evaluating its two components: hardware MTBS (MTBSH) and 
software MTBS (MTBSS). 


The hardware stability is determined from component failure rates, the coverage of the fault detection 
mechanisms, and the recoverability of the system from each of the faults detected that do not lead to a 
system stop. The failure rates are obtained by studying vendor, government, and internal failure data for 
each of the integrated circuits and components used in the units. Coverage is analyzed by examining the 
current detailed hardware documentation for the system and determining which faultdetection circuits will 
capture which faults, and in which chips. The recoverability factor is determined by studying what 


percentage of detected faults can be recovered from by each recovery mechanism and by analyzing the - 


’ probability of success of that recovery. 


The software stability is calculated by examining the inherent characteristics of the modules, such as size 
and complexity; considering the quality assurance during development: and evaluating the environment in 
which the software will be used based on our experience with the stability of similar software systems in 
the past. 


The other major ARM factor, system availability, is predicted by considering the MTBFs and MTTRs of the 


various units in the central complex, the amount of redundancy of units in the system, and the recovery 
time necessary following a system stop. The application of the evaluation method described above proved 
to be a valuable tool for management and design personnel in gaining visibility of the unfolding ARM 
characteristics of the 1100/60 CPU versus the established goals for development. 


10 


nomena non. 


Vi. 


SUMMARY 


As the applications of computers become more complex and sophisticated, and as new electronic 
technologies emerge in the industry, the demand, as well as the potential, for increased availability, 
reliability, and maintainability appears to be growing. The SPERRY ‘UNIVAC 1100/60. information 
processing system reflects the results of a coherent development effort that takes advantage of the current 
state of the art in achieving a high level of inherent quality of design and a dramatic increase in the fault 
tolerant attributes of the commercial computer system. 


11 


aoe + 


BIBLIOGRAPHY 


1. 


Sedmak, R. M., H. L. Liebergot, ‘Fault Tolerance of a General Purpose Computer Implemented by Very 
Large Scale Integration”, Digest of Papers, FTCS-8, The Eighth Annual Conference on Fault Tolerant 
Computing, Toulouse, France, pp. 137-143, June 1978. 


Borgerson, B. R., M. L. Hanson, and P. A. Hartley, “The Evolution of the Sperry Univac 1100 Series: A 
History, Analysis, and Projection’, CACM, pp. 25-43, January 1978. 


Boone, L. A., et al, “The Microarchitecture of the Sperry Univac 1100/60”, to be published. 


Stewart, J. H., “Future Testing of Large LSI Circuit Cards’, Digest of Papers of the 1977 Semiconductor 
Test Symposium, Cherry Hill, N. J., pp. 6-15, October 1977. 


Stewart, J. H., “Application of SCAN SET for Error Detection and Diagnostics," Digest of Papers of the 
1978 Semiconductor Test Conference, Cherry Hill, N. J., pp. 152-158, October 1978. 


12 


TO MAIN STORAGE 


MAIN DATA BUS 


SUBPROCESSOR 1 SUBPROCESSOR 2 


GENERAL — LOCAL LOCAL 
REGISTER STORAGE SHIFTER STORAGE 
SET 1 2 


ERROR 
STATUS 
REGISTER 


SHIFTER 
INPUT SELECTOR 


FROM MAIN STORAGE 


Figure 1. Block Diagram of the 1100/60 MICROEXECUTION SECTION, Showing Sample Applications of Fault Detection 
Techniques 


t 


TO FAULT HANDLING LOGIC 


COMPARATOR 


DUPLICATE DATA BUS fF 


MASTER DUPLICATE - MASTER DUPLICATE 
SUB- SUB- ; SUB- , SUB- 
PROCESSOR PROCESSOR ~- PROCESSOR PROCESSOR 
1 1 : 2 2 


Figure 2. Example of Fault Detection through Duplication in the 1100/60 IP 


